Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation (2403.15356v2)

Published 22 Mar 2024 in cs.CV

Abstract: The development of foundation models has revolutionized our ability to interpret the Earth's surface using satellite observational data. Traditional models have been siloed, tailored to specific sensors or data types like optical, radar, and hyperspectral, each with its own unique characteristics. This specialization hinders the potential for a holistic analysis that could benefit from the combined strengths of these diverse data sources. Our novel approach introduces the Dynamic One-For-All (DOFA) model, leveraging the concept of neural plasticity in brain science to integrate various data modalities into a single framework adaptively. This dynamic hypernetwork, adjusting to different wavelengths, enables a single versatile Transformer jointly trained on data from five sensors to excel across 12 distinct Earth observation tasks, including sensors never seen during pretraining. DOFA's innovative design offers a promising leap towards more accurate, efficient, and unified Earth observation analysis, showcasing remarkable adaptability and performance in harnessing the potential of multimodal Earth observation data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Camps-Valls, G., Tuia, D., Zhu, X.X., Reichstein, M.: Deep learning for the Earth sciences: A comprehensive approach to remote sensing, climate science and geosciences (2021) Reichstein et al. 2019 Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., Prabhat: Deep learning and process understanding for data-driven Earth system science. Nature 566(7743), 195–204 (2019) Roy et al. 2014 Roy, D.P., Wulder, M.A., Loveland, T.R., Woodcock, C.E., Allen, R.G., Anderson, M.C., Helder, D., Irons, J.R., Johnson, D.M., Kennedy, R., et al.: Landsat-8: Science and product vision for terrestrial global change research. Remote Sensing of Environment 145, 154–172 (2014) Drusch et al. 2012 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., Prabhat: Deep learning and process understanding for data-driven Earth system science. Nature 566(7743), 195–204 (2019) Roy et al. 2014 Roy, D.P., Wulder, M.A., Loveland, T.R., Woodcock, C.E., Allen, R.G., Anderson, M.C., Helder, D., Irons, J.R., Johnson, D.M., Kennedy, R., et al.: Landsat-8: Science and product vision for terrestrial global change research. Remote Sensing of Environment 145, 154–172 (2014) Drusch et al. 2012 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Roy, D.P., Wulder, M.A., Loveland, T.R., Woodcock, C.E., Allen, R.G., Anderson, M.C., Helder, D., Irons, J.R., Johnson, D.M., Kennedy, R., et al.: Landsat-8: Science and product vision for terrestrial global change research. Remote Sensing of Environment 145, 154–172 (2014) Drusch et al. 2012 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  2. Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., Carvalhais, N., Prabhat: Deep learning and process understanding for data-driven Earth system science. Nature 566(7743), 195–204 (2019) Roy et al. 2014 Roy, D.P., Wulder, M.A., Loveland, T.R., Woodcock, C.E., Allen, R.G., Anderson, M.C., Helder, D., Irons, J.R., Johnson, D.M., Kennedy, R., et al.: Landsat-8: Science and product vision for terrestrial global change research. Remote Sensing of Environment 145, 154–172 (2014) Drusch et al. 2012 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Roy, D.P., Wulder, M.A., Loveland, T.R., Woodcock, C.E., Allen, R.G., Anderson, M.C., Helder, D., Irons, J.R., Johnson, D.M., Kennedy, R., et al.: Landsat-8: Science and product vision for terrestrial global change research. Remote Sensing of Environment 145, 154–172 (2014) Drusch et al. 2012 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  3. Roy, D.P., Wulder, M.A., Loveland, T.R., Woodcock, C.E., Allen, R.G., Anderson, M.C., Helder, D., Irons, J.R., Johnson, D.M., Kennedy, R., et al.: Landsat-8: Science and product vision for terrestrial global change research. Remote Sensing of Environment 145, 154–172 (2014) Drusch et al. 2012 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  4. Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B., Isola, C., Laberinti, P., Martimort, P., et al.: Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sensing of Environment 120, 25–36 (2012) Salomonson et al. 1989 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  5. Salomonson, V.V., Barnes, W., Maymon, P.W., Montgomery, H.E., Ostrow, H.: MODIS: Advanced facility instrument for studies of the Earth as a system. IEEE Transactions on Geoscience and Remote Sensing 27(2), 145–153 (1989) Guanter et al. 2015 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  6. Guanter, L., Kaufmann, H., Segl, K., Foerster, S., Rogass, C., Chabrillat, S., Kuester, T., Hollstein, A., Rossner, G., Chlebek, C., et al.: The EnMAP spaceborne imaging spectroscopy mission for Earth observation. Remote Sensing 7(7), 8830–8857 (2015) Huang et al. 2018 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  7. Huang, W., Sun, S., Jiang, H., Gao, C., Zong, X.: GF-2 satellite 1m/4m camera design and in-orbit commissioning. Chinese Journal of Electronics 27(6), 1316–1321 (2018) USDA Farm Service Agency (FSA) 2015 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  8. USDA Farm Service Agency (FSA): National Agriculture Imagery Program (NAIP). USDA Geospatial Data Gateway (2015) Zhu et al. 2017 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  9. Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F.: Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5(4), 8–36 (2017) Schmitt et al. 2023 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  10. Schmitt, M., Ahmadi, S.A., Xu, Y., Taşkın, G., Verma, U., Sica, F., Hänsch, R.: There are no data like more data: Datasets for deep learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine (2023) Xiong et al. 2022 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  11. Xiong, Z., Zhang, F., Wang, Y., Shi, Y., Zhu, X.X.: EarthNets: Empowering AI in Earth observation. arXiv preprint arXiv:2210.04936 (2022) Bommasani et al. 2021 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  12. Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) Mendieta et al. 2023 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  13. Mendieta, M., Han, B., Shi, X., Zhu, Y., Chen, C.: Towards geospatial foundation models via continual pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16806–16816 (2023) Reed et al. 2023 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  14. Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T.: Scale-MAE: A scale-aware masked autoencoder for multiscale geospatial representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4088–4099 (2023) Tang et al. 2024 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  15. Tang, M., Cozma, A., Georgiou, K., Qi, H.: Cross-Scale MAE: A tale of multiscale exploitation in remote sensing. Advances in Neural Information Processing Systems 36 (2024) Wang et al. 2023 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  16. Wang, Y., Hernández, H.H., Albrecht, C.M., Zhu, X.X.: Feature guided masked autoencoder for self-supervised learning in remote sensing. arXiv preprint arXiv:2310.18653 (2023) Cong et al. 2022 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  17. Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D., Ermon, S.: SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35, 197–211 (2022) Stewart et al. 2024 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  18. Stewart, A., Lehmann, N., Corley, I., Wang, Y., Chang, Y.-C., Ait Ali Braham, N.A., Sehgal, S., Robinson, C., Banerjee, A.: SSL4EO-L: Datasets and foundation models for Landsat imagery. Advances in Neural Information Processing Systems 36 (2024) Fuller et al. 2023 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  19. Fuller, A., Millard, K., Green, J.R.: CROMA: Remote sensing representations with contrastive radar-optical masked autoencoders. arXiv preprint arXiv:2311.00566 (2023) Wang et al. 2023 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  20. Wang, Y., Albrecht, C.M., Braham, N.A.A., Liu, C., Xiong, Z., Zhu, X.X.: DeCUR: decoupling common & unique representations for multimodal self-supervision. arXiv preprint arXiv:2309.05300 (2023) Hong et al. 2024 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  21. Hong, D., Zhang, B., Li, X., Li, Y., Li, C., Yao, J., Yokoya, N., Li, H., Ghamisi, P., Jia, X., et al.: SpectralGPT: Spectral remote sensing foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) Bastani et al. 2023 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  22. Bastani, F., Wolters, P., Gupta, R., Ferdinando, J., Kembhavi, A.: SatlasPretrain: A large-scale dataset for remote sensing image understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16772–16782 (2023) Hebb 2005 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  23. Hebb, D.O.: The organization of behavior: A neuropsychological theory (2005) Zucker and Regehr 2002 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  24. Zucker, R.S., Regehr, W.G.: Short-term synaptic plasticity. Annual Review of Physiology 64(1), 355–405 (2002) Dan and Poo 2004 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  25. Dan, Y., Poo, M.-m.: Spike timing-dependent plasticity of neural circuits. Neuron 44(1), 23–30 (2004) Pittenger and Duman 2008 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  26. Pittenger, C., Duman, R.S.: Stress, depression, and neuroplasticity: A convergence of mechanisms. Neuropsychopharmacology 33(1), 88–109 (2008) Dayan and Cohen 2011 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  27. Dayan, E., Cohen, L.G.: Neuroplasticity subserving motor skill learning. Neuron 72(3), 443–454 (2011) Buckmaster et al. 2002 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  28. Buckmaster, P.S., Zhang, G.F., Yamawaki, R.: Axon sprouting in a model of temporal lobe epilepsy creates a predominantly excitatory feedback circuit. Journal of Neuroscience 22(15), 6650–6658 (2002) Duman and Duman 2015 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  29. Duman, C.H., Duman, R.S.: Spine synapse remodeling in the pathophysiology and treatment of depression. Neuroscience Letters 601, 20–29 (2015) Lillicrap et al. 2020 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  30. Lillicrap, T.P., Santoro, A., Marris, L., Akerman, C.J., Hinton, G.: Backpropagation and the brain. Nature Reviews Neuroscience 21(6), 335–346 (2020) Zhang et al. 2023 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  31. Zhang, T., Cheng, X., Jia, S., Li, C.T., Poo, M.-m., Xu, B.: A brain-inspired algorithm that mitigates catastrophic forgetting of artificial and spiking neural networks with low computational cost. Science Advances 9(34), 2947 (2023) Ha et al. 2017 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  32. Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. In: ICLR 2017 (2017) Deng et al. 2009 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  33. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee Lacoste et al. 2023 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  34. Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E.D., Kerner, H., Lütjens, B., Irvin, J.A., Dao, D., Alemohammad, H., Drouin, A., et al.: GEO-Bench: Toward foundation models for Earth monitoring. arXiv preprint arXiv:2306.03831 (2023) Xiong et al. 2024 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  35. Xiong, Z., Wang, Y., Zhang, F., Zhu, X.X.: One for all: Toward unified foundation models for Earth vision. arXiv preprint arXiv:2401.07527 (2024) He et al. 2022 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  36. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022) Dosovitskiy et al. 2020 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  37. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) Liu et al. 2021 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  38. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021) Liu et al. 2022 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  39. Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022) Xiao et al. 2018 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  40. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018) Cheng et al. 2017 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  41. Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017) McInnes et al. 2018 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  42. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Van der Maaten and Hinton 2008 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  43. Maaten, L., Hinton, G.: Visualizing data using t-SNE. Journal of Machine Learning Research 9(11) (2008) Agarap 2018 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  44. Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018) Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  45. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) Zhang et al. 2017 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  46. Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030 (2017) Sung et al. 2018 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  47. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018) Xiong et al. 2022 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  48. Xiong, Z., Li, H., Zhu, X.X.: Doubly deformable aggregation of covariance matrices for few-shot segmentation. In: European Conference on Computer Vision, pp. 133–150 (2022). Springer Liu et al. 2024 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  49. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2024) Touvron et al. 2023 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  50. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) Brown et al. 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  51. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems 33, 1877–1901 (2020) OpenAI 2023 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  52. OpenAI: ChatGPT (June 26 version) [large language model] (2023). https://chat.openai.com/chat Radford et al. 2021 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  53. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PMLR Li et al. 2022 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  54. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900 (2022). PMLR Rombach et al. 2021 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  55. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021) Kirillov et al. 2023 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  56. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.-Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023) Grill et al. 2020 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  57. Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020) Caron et al. 2021 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  58. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021) Chen et al. 2020 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  59. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020). PMLR Frome et al. 2013 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  60. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: A deep visual-semantic embedding model. Advances in Neural Information Processing Systems 26 (2013) Ye et al. 2019 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  61. Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10502–10511 (2019) Lu et al. 2022 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  62. Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-IO: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916 (2022) Zou et al. 2023 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  63. Zou, X., Dou, Z.-Y., Yang, J., Gan, Z., Li, L., Li, C., Dai, X., Behl, H., Wang, J., Yuan, L., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15116–15127 (2023) Zhang et al. 2023 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  64. Zhang, Y., Gong, K., Zhang, K., Li, H., Qiao, Y., Ouyang, W., Yue, X.: Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802 (2023) Manas et al. 2021 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  65. Manas, O., Lacoste, A., Giró-i-Nieto, X., Vazquez, D., Rodriguez, P.: Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9414–9423 (2021) Mall et al. 2023 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  66. Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5261–5270 (2023) Cha et al. 2023 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  67. Cha, K., Seo, J., Lee, T.: A billion-scale foundation model for remote sensing images. arXiv preprint arXiv:2304.05215 (2023) Yao et al. 2023 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  68. Yao, F., Lu, W., Yang, H., Xu, L., Liu, C., Hu, L., Yu, H., Liu, N., Deng, C., Tang, D., et al.: RingMo-sense: Remote sensing foundation model for spatiotemporal prediction via spatiotemporal evolution disentangling. IEEE Transactions on Geoscience and Remote Sensing (2023) Irvin et al. 2023 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  69. Irvin, J., Tao, L., Zhou, J., Ma, Y., Nashold, L., Liu, B., Ng, A.Y.: USat: A unified self-supervised encoder for multi-sensor satellite imagery. arXiv preprint arXiv:2312.02199 (2023) Wang et al. 2023 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  70. Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in Earth observation. IEEE Geoscience and Remote Sensing Magazine 11(3), 98–106 (2023) Ayush et al. 2021 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  71. Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D., Ermon, S.: Geography-aware self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10181–10190 (2021) Cepeda et al. 2023 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  72. Cepeda, V.V., Nayak, G.K., Shah, M.: GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization. arXiv preprint arXiv:2309.16020 (2023) Klemmer et al. 2023 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  73. Klemmer, K., Rolf, E., Robinson, C., Mackey, L., Rußwurm, M.: SatCLIP: Global, general-purpose location embeddings with satellite imagery. arXiv preprint arXiv:2311.17179 (2023) Guo et al. 2023 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  74. Guo, X., Lao, J., Dang, B., Zhang, Y., Yu, L., Ru, L., Zhong, L., Huang, Z., Wu, K., Hu, D., et al.: SkySense: A multi-modal remote sensing foundation model towards universal interpretation for Earth observation imagery. arXiv preprint arXiv:2312.10115 (2023) Jean et al. 2019 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  75. Jean, N., Wang, S., Samar, A., Azzari, G., Lobell, D., Ermon, S.: Tile2Vec: Unsupervised representation learning for spatially distributed data. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3967–3974 (2019) Christie et al. 2018 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  76. Christie, G., Fendley, N., Wilson, J., Mukherjee, R.: Functional map of the world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6172–6180 (2018) Wang et al. 2022 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  77. Wang, Y., Braham, N.A.A., Xiong, Z., Liu, C., Albrecht, C.M., Zhu, X.X.: SSL4EO-S12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in Earth observation. arXiv preprint arXiv:2211.07044 (2022) Tong et al. 2023 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  78. Tong, X.-Y., Xia, G.-S., Zhu, X.X.: Enabling country-scale land cover mapping with meter-resolution satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing 196, 178–196 (2023) Fuchs and Demir 2023 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  79. Fuchs, M.H.P., Demir, B.: HySpecNet-11k: A large-scale hyperspectral dataset for benchmarking learning-based hyperspectral image compression methods. arXiv preprint arXiv:2306.00385 (2023) Steiner et al. 2021 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  80. Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021) Loshchilov and Hutter 2017 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  81. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) Stewart et al. 2022 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953 Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
  82. Stewart, A.J., Robinson, C., Corley, I.A., Ortiz, A., Lavista Ferres, J.M., Banerjee, A.: TorchGeo: Deep learning with geospatial data. In: Proceedings of the 30th International Conference on Advances in Geographic Information Systems. SIGSPATIAL ’22, pp. 1–12. Association for Computing Machinery, Seattle, Washington (2022). https://doi.org/10.1145/3557915.3560953
Citations (5)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents DOFA, a model that uses dynamic weight generation inspired by neural plasticity to unify multimodal Earth observation data processing.
  • It employs a hypernetwork conditioned on spectral wavelengths and a shared Transformer backbone to tailor processing for different sensor modalities.
  • Experimental evaluations across 13 tasks reveal DOFA’s swift convergence, increased accuracy, and robust performance on unseen sensor data.

Neural Plasticity-Inspired Foundation Model for Adaptive Earth Observation

Introduction

The integration of Earth Observation (EO) data across diverse sensing modalities presents a complex challenge in the domain of remote sensing and AI. Traditional models tend to specialize in handling data from specific sensors, limiting the comprehensive analysis potential when fusing data types like optical, radar, and hyperspectral imagery. This specialization comes at the expense of broader applicability and efficiency in processing multifaceted EO data. The paper introduces a novel approach, the Dynamic One-For-All (DOFA) model, inspired by neural plasticity mechanisms observed in the human brain. This model utilizes a dynamic hypernetwork, conditioned on the wavelengths of input bands, to seamlessly adjust network weights for varying modalities, enabling a versatile processing framework capable of remarkable adaptability across various EO applications.

Methodology

The core innovation in DOFA lies in its dynamic weight generation mechanism, tailored to accommodate the inherent diversity in EO data modalities. By inputting the central wavelengths of spectral bands, the model dynamically synthesizes network weights, facilitating specialized processing for each modality within a unified architecture. This approach draws inspiration from neural plasticity, reflecting the brain's capacity to adapt its neural connections in response to new experiences. The model employs:

  • Hypernetworks that dynamically generate weights based on input wavelengths, allowing custom-tailored processing for different data types.
  • A shared Transformer backbone that serves as a universal feature extractor, learning modality-agnostic representations beneficial for a wide range of downstream tasks.
  • Masked Image Modelling (MIM) strategy for self-supervised pretraining, coupled with a distillation loss to enhance learning efficiency and model performance.

Experimental Results

The efficacy of DOFA is demonstrated through exhaustive evaluations across 13 distinct downstream tasks, covering a wide array of EO applications. The model exhibits superior performance in most scenarios, outclassing existing state-of-the-art (SOTA) foundation models. These tasks encompass both classification and segmentation challenges, with DOFA achieving noteworthy results, particularly in domains involving sensors not encountered during the pretraining phase. Such versatility underscores DOFA's potential as a unified, adaptive foundation model for EO analysis. The reported experimental outcomes highlight DOFA's swift convergence and higher accuracy across various datasets, affirming its practical utility and adaptability in real-world applications.

Implications and Future Directions

The introduction of DOFA marks a significant stride towards realizing a unified, multimodal analysis framework in Earth observation. By harnessing the full spectrum of available EO data, this model paves the way for more nuanced and comprehensive environmental assessments. The practical implications of DOFA span across climate monitoring, disaster response, and sustainable development, showcasing the potential to discern intricate environmental processes through a singular, adaptive modeling approach.

Future research directions include extending DOFA's capabilities to encompass an even broader array of data types and exploring the integration of time-series analysis to capture dynamic environmental changes. Moreover, the model's foundational concept opens avenues for application beyond EO, potentially benefiting domains such as medical imaging, robotics, and climate modeling where multimodal data analysis is paramount.

In conclusion, DOFA emerges as a pioneering framework that adeptly navigates the complexity of multimodal EO data, offering a scalable, efficient solution to harnessing the wealth of information encapsulated in diverse sensing technologies. Its neural plasticity-inspired design not only advances the state-of-the-art in EO data analysis but also exemplifies the potential of drawing insights from biological systems to address computational challenges.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.